AITopics | rollout policy

f9e2800a251fa9107a008104f47c45d1-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 23:03:29 GMT

After the bidirectional models and rollout policies are well trained, we utilize them to generate imaginary trajectories, while conducting double check and admitting high-confidence transitions simultaneously.

artificial intelligence, dataset, machine learning, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.95)

Add feedback

Large Language Models Can Implement Policy Iteration Ethan Brooks 1, Logan Walls 2, Richard L. Lewis

Neural Information Processing SystemsFeb-12-2026, 14:42:57 GMT

Gradient techniques are inherently slow, sacrificing the "few-shot" quality

large language model, machine learning, reinforcement learning, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Michigan (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report (0.70)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.95)

Add feedback

OfflineReinforcementLearningwithReverse Model-basedImagination

Neural Information Processing SystemsFeb-11-2026, 22:58:47 GMT

However, in many real-world applications, collecting sufficient exploratory interactions is usually impractical, because online datacollection canbecostlyorevendangerous, suchasinhealthcare [4]andautonomous driving [5]. To address this challenge, offline RL [6, 7] develops a new learning paradigm that trains RL agents only with pre-collected offline datasets and thus can abstract away from the cost of online exploration [8-17].

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Country: Asia > China (0.04)

Genre: Research Report (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

OfflineReinforcementLearningwithReverse Model-basedImagination

Neural Information Processing SystemsFeb-11-2026, 22:58:43 GMT

However, in many real-world applications, collecting sufficient exploratory interactions is usually impractical, because online datacollection canbecostlyorevendangerous, suchasinhealthcare [4]andautonomous driving [5]. To address this challenge, offline RL [6, 7] develops a new learning paradigm that trains RL agents only with pre-collected offline datasets and thus can abstract away from the cost of online exploration [8-17].

artificial intelligence, machine learning, reinforcement learning, (15 more...)

Neural Information Processing Systems

Country: Asia > China (0.04)

Genre: Research Report (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Zheng, Chujie, Dang, Kai, Yu, Bowen, Li, Mingze, Jiang, Huiqiang, Lin, Junrong, Liu, Yuqiong, Lin, Hao, Wu, Chencan, Hu, Feng, Yang, An, Zhou, Jingren, Lin, Junyang

arXiv.org Artificial IntelligenceDec-4-2025

This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.

large language model, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2512.01374

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.71)

Add feedback

A Credit Assignment Compiler for Joint Prediction

Kai-Wei Chang, He He, Stephane Ross, Hal Daume III, John Langford

Neural Information Processing SystemsNov-21-2025, 07:57:21 GMT

Neural Information Processing Systems http://nips.cc/

machine learning, natural language, prediction, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Maryland (0.04)
North America > United States > Virginia (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

Large Language Models Can Implement Policy Iteration Ethan Brooks 1, Logan Walls 2, Richard L. Lewis

Neural Information Processing SystemsOct-10-2025, 23:19:27 GMT

Gradient techniques are inherently slow, sacrificing the "few-shot" quality

large language model, machine learning, reinforcement learning, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Michigan (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report (0.70)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.95)

Add feedback

Lagrangian Relaxation for Multi-Action Partially Observable Restless Bandits: Heuristic Policies and Indexability

Meshram, Rahul, Kaza, Kesav

arXiv.org Artificial IntelligenceSep-3-2025

Partially observable restless multi-armed bandits have found numerous applications including in recommendation systems, communication systems, public healthcare outreach systems, and in operations research. We study multi-action partially observable restless multi-armed bandits, it is a generalization of the classical restless multi-armed bandit problem -- 1) each bandit has finite states, and the current state is not observable, 2) each bandit has finite actions. In particular, we assume that more than two actions are available for each bandit. We motivate our problem with the application of public-health intervention planning. We describe the model and formulate a long term discounted optimization problem, where the state of each bandit evolves according to a Markov process, and this evolution is action dependent. The state of a bandit is not observable but one of finitely many feedback signals are observable. Each bandit yields a reward, based on the action taken on that bandit. The agent is assumed to have a budget constraint. The bandits are assumed to be independent. However, they are weakly coupled at the agent through the budget constraint. We first analyze the Lagrangian bound method for our partially observable restless bandits. The computation of optimal value functions for finite-state, finite-action POMDPs is non-trivial. Hence, the computation of Lagrangian bounds is also challenging. We describe approximations for the computation of Lagrangian bounds using point based value iteration (PBVI) and online rollout policy. We further present various properties of the value functions and provide theoretical insights on PBVI and online rollout policy. We study heuristic policies for multi-actions PORMAB. Finally, we discuss present Whittle index policies and their limitations in our model.

artificial intelligence, data mining, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2509.00415

Country: